Objective

  1. Forecast 5 years
  2. Understand likelihood that prod index surpasses 135

Libraries

Importing a few libraries needed for this analysis.

library(tidyverse)
library(forecast)
library(dygraphs)
library(astsa)
library(urca)
library(ggthemr)

Data Import & Preprocessing

There are two pieces of data I was able to pull from the source.

  1. The ‘Industrial Production: Electric and gas utilities’, labeled data IPG2211A2N
  2. The time periods for recession, as shown in gray in the plot online

These are downloaded as csv files, and some preprocessing is done to shape them in a format ready for modeling.

df <- read_csv("../data/IPG2211A2N.csv",col_names = c("date","ip_index"),skip = 1)
recession <- read_csv("../data/recession.csv",col_names = c("start","end"),skip = 1)

recession_dates = recession %>% mutate(range = map2(start,end,~seq(from = .x,to = .y,by = "month"))) %>% select(range) %>% unnest() %>% pull()
head(df)
head(recession)

The electric and gas index is stored as a time series ts object, since quite a bit of the forecasting and plotting functions prefer ts as opposed to data.frame or tibble objects.

egip <- ts(data = df$ip_index,
           start = 1939,
           frequency = 12,
           names = "egip")
head(egip,18)
##         Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct
## 1939 3.3842 3.4100 3.4875 3.5133 3.5133 3.5650 3.5650 3.6167 3.7200 3.7200
## 1940 3.7717 3.8233 3.8492 3.8492 3.8750 3.9267                            
##         Nov    Dec
## 1939 3.7458 3.7458
## 1940

EDA

Let’s visualize the data. The graph shows the monthly values of Electric Gas Industrial Production (EGIP). The gray bars show the recessions. Initial observations:

  1. Not a stationary series. Strong presence of a trend throughout the time series, although much steeper slope of the trend till ~2009. After the 2009 recession, I can see a much lower, albeit positive trend.
  2. Definite presence of prediodic behavior on almost the entire time series. ‘Almost’ since pre-1948, ther seems to be little/no seasonality at all. (Zoom in to check this). Another observation is that visually, the seasonality between 1960-1973 is almost constant, i.e. after accounting for the trend, each Jan is roughly equal to Aug. From 1973 onwards, I see a marked change in this seasonal behaviour with the Jan peaks much higher than the Aug peaks.
  3. Zooming into any one year, we can see a ‘double dip’: EGIP is larger in Dec-Jan, and Jul-Aug time frame, i.e. every 6 months. Presumeably, this is could be due to higher cooling costs at peak summer and and higher heating costs at peak winter. This double dip is consistantly present throughout the series.
  4. Whereever there are active recessions, I notice that the peaks are a bit lower than the preceding non-recession years (1974). In some cases, the trend actually changes direction too (1981, 2009).
  5. The variance of the series increases with time. I would want to address this first.
dygraph_plotter <- function(ts, recession, title = NULL){
        command <- paste0("dygraph(",ts,",xlab = 'Year', ylab = 'Industrial Production: Electric & Gas', main = '", title, "') %>% ",
        glue::glue("dyShading(from = recession$start[{x}],to = recession$end[{x}]) %>% ",x=1:nrow(recession)) %>% paste0(collapse = ""),
        "dyRangeSelector()")
        eval(parse(text = command))
}
dygraph_plotter("egip", recession)

The seasonplots plot the time series data split by each “season”, in our case, by “Year”. This plot clearly shows what we’ve already noticed - levels go up over the years, seasonality becomes more pronounced.

ggseasonplot(egip)

Zooming into the data from Y:2000, we can see that - on average - EGIP increases every year for all the months.

ggsubseriesplot(window(egip, start=2000))+labs(y="EGIP")

Modeling Approach

Considering these initial observations, here’s the modeling approach I am proposing:

  1. Split the data into train & test.
  2. For the Training Sets (TS), I have two ideas in mind:
    1. TS1: Keep the entire time series as is and perform some modeling
    2. TS2: Eliminate some of the initial portion of the time series data where the seasonality is non-existant or different
  3. For each type of TS, apply a few models -
    1. Seasonal Naive (to get a baseline performance value),
    2. ARIMA (Auto-Regressive Integrated Moving Average),
    3. Dynamic regression models (regression model with errors following ARIMA model),
    4. ETS (Exponential Smoothing),
    5. STLF (Seasonal decomposition of Time-series using Loess),
    6. Prophet
  4. Select the method which most reduces the prediction error on the test dataset
  5. I’ll also try an weighted approach, which will weight these different forecasts together, in an attempt to generate a better point forecast from an ensemble model rather than an individual model

Note:

I investigated 2(a) for most of the models listed in 3. I found that they quite underperformed when compared against approach 2(b). In particular, they all overestimated the trend component in the forecasts, resulting in consistantly biased forecasts. Approach 2(b) on the other hand, worked much better in estimating the forecasts. Thus, in the rest of the report, I will only be speaking about approach 2(b).

Selection criteria

  1. While model building, we can use parameters like AIC or BIC to select the better model parameters.
  2. Model performance can be judged using metrics like MASE (Mean Abs Scaled Error)

Train & Test Split

Here, I split the data into training (1984 - 2014) and testing (>2014). I’m keeping 5 years in testing since the task is to forecast out 5 years, and this is a typical practice. Note - this is where I’m subsetting to data 1984+ only.

egip_train = window(egip, start = 1984, end = 2013+11/12)
egip_test = window(egip, start = 2014)
h = length(egip_test) #forecast length

Transformations

I used the Box-Cox test to determine the value of lamda to perform the transformation. Transforming the series using this value, we can see that the variance is much stabilized than the original series. Var names have prefix “t” to indicate transformed series.

lam = BoxCox.lambda(egip_train)
tegip_train = BoxCox(egip_train, lam)
tegip_test = BoxCox(egip_test, lam)
dygraph_plotter("tegip_train", recession, title = paste0("Box Cox Transformed Training Data. Lamda = ", round(lam,4)))

Models

Seasonal Naive Forecasts

Running a simplistic seasonal naive forecast first. This can also serve a baseline - no reason to select anything more complex of a seasonal naive works well. We can see that it works reasonably well, though we do miss out on a few peaks and troughs.

snaiveForecast <- snaive(tegip_train,h = h) 
snaiveForecast %>% autoplot() + autolayer(tegip_test) + 
        scale_x_continuous(limits = c(2010,2019),breaks = seq(2010,2019))+
        scale_y_continuous(limits=c(3,3.25))

snaiveForecast %>% accuracy(tegip_test)
##                       ME       RMSE        MAE        MPE      MAPE
## Training set 0.008147698 0.01772244 0.01434680 0.26860008 0.4692704
## Test set     0.003073970 0.01632939 0.01215974 0.09431929 0.3861878
##                   MASE      ACF1 Theil's U
## Training set 1.0000000 0.5649622        NA
## Test set     0.8475575 0.3158671 0.3809585

ARIMA

Using the box-jenkins framework might prove promising, given the seasonality present in the dataset. First step is to make the non-stationary data stationary. A quick investigation into how many diffs to consider can be done using ndiffs and nsdiffs.

Single differening d = 1

tegip_train %>% 
        diff() %>% 
        dygraph() %>% 
        dyOptions(drawPoints = TRUE, pointSize = 2) %>% 
        dyLimit(mean, color = "red") %>% 
        dyLimit(mean(diff(tegip_train),na.rm=T),color = "red") %>% 
        dyRangeSelector()

Check for stationarity, this series is stationary since the test-statistic is less than the 1% critical value.

tegip_train %>% diff(d = 1) %>% ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0483 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Seasonal Difference D[12] = 1

There is still some trending left, and as a result the KPSS test fails stationarity.

tegip_train %>% 
        diff(lag = 12) %>% 
        dygraph() %>% 
        dyOptions(drawPoints = TRUE, pointSize = 3) %>% 
        dyLimit(mean, color = "red") %>% 
        dyLimit(mean(diff(diff(tegip_train)),na.rm=T),color = "red") %>% 
        dyRangeSelector()
tegip_train %>% diff(lag = 12) %>% ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 1.0149 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Seasonal Difference D[12] = 1 followed by simple differencing d=1

Adding a simple difference to the previous result makes the signal quite stationary. KPSS test proves this too.

tegip_train %>% 
        diff(lag = 12) %>% 
        diff() %>% 
        dygraph() %>% 
        dyOptions(drawPoints = TRUE, pointSize = 3) %>% 
        dyLimit(mean, color = "red") %>% 
        dyLimit(mean(diff(diff(tegip_train)),na.rm=T),color = "red") %>% 
        dyRangeSelector()
tegip_train %>% diff(lag = 12) %>% diff() %>% ur.kpss() %>% summary() # Stationary
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0192 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Let’s stick with the the D=12 + d=1. I suspect it will be required based on the seasonality. The ACF/PACF plots should tell us more.

d1D12tegip_train = diff(diff(tegip_train, lag=12))

What do the ACF plots tell us?

d1D12_acf = acf2(d1D12tegip_train, plot = T, max.lag = 12*4)

Note: in these graphs, the LAG are shown in units of 12-months.

For the d1D12 data, it seems like there is a sharp decline in the seasonal component (integer LAG values) in the ACF, while a slower decay in the PACF. Could indicate presence of a MA(2) for the seasonal component (Q). Hard to infer the ‘p’ and ‘q’ components visually. I could guess p=2 and q=2 based on the sub-LAG components.

So perhaps an SARIMA(2,1,2)(0,1,2)[12] on the box-cox transformed time series is a potential model. Perhaps a SARIMA(1,1,1)(0,1,2)[12]. Let’s investigate.

ARIMA MODEL 1

In the SARIMA(1,1,1)(0,1,2)[12], residuals are quite normal (except a few outliers at the edges). ACF plots show a few components outside the limits, as affirmed by the joint-test Ljung-Box test, which has p-values below 0.05 for all lag values. This indicates we’ve failed the null hypothesis test that the residuals are no different from white noise.

arimaFit1 <- sarima(xdata = tegip_train,
       p = 0,d = 1,q = 1,
       P = 0,D = 1,Q = 2,S = 12)
## initial  value -4.224828 
## iter   2 value -4.412504
## iter   3 value -4.455195
## iter   4 value -4.466665
## iter   5 value -4.473913
## iter   6 value -4.474783
## iter   7 value -4.474933
## iter   8 value -4.474940
## iter   9 value -4.474941
## iter  10 value -4.474941
## iter  11 value -4.474942
## iter  11 value -4.474942
## iter  11 value -4.474942
## final  value -4.474942 
## converged
## initial  value -4.484203 
## iter   2 value -4.486515
## iter   3 value -4.487276
## iter   4 value -4.487507
## iter   5 value -4.487537
## iter   6 value -4.487538
## iter   7 value -4.487538
## iter   7 value -4.487538
## iter   7 value -4.487538
## final  value -4.487538 
## converged

arimaFit1$fit
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), include.mean = !no.constant, optim.control = list(trace = trc, 
##     REPORT = 1, reltol = tol))
## 
## Coefficients:
##           ma1     sma1     sma2
##       -0.3967  -0.7494  -0.0656
## s.e.   0.0784   0.0632   0.0633
## 
## sigma^2 estimated as 0.000122:  log likelihood = 1064.8,  aic = -2121.61

ARIMA MODEL 2

SARIMA(2,1,2)(0,1,2)[12] is much more promising. We pass the Ljung-Box test in this instance, and the residuals look good too.

arimaFit2 = sarima(xdata = tegip_train,
       p = 2,d = 1,q = 2,
       P = 0,D = 1,Q = 2,S = 12)
## initial  value -4.239317 
## iter   2 value -4.451477
## iter   3 value -4.472258
## iter   4 value -4.489662
## iter   5 value -4.494134
## iter   6 value -4.507379
## iter   7 value -4.513832
## iter   8 value -4.515320
## iter   9 value -4.516732
## iter  10 value -4.518059
## iter  11 value -4.518925
## iter  12 value -4.519182
## iter  13 value -4.519207
## iter  14 value -4.519219
## iter  15 value -4.519223
## iter  16 value -4.519226
## iter  17 value -4.519228
## iter  18 value -4.519229
## iter  19 value -4.519230
## iter  20 value -4.519230
## iter  21 value -4.519230
## iter  21 value -4.519230
## iter  21 value -4.519230
## final  value -4.519230 
## converged
## initial  value -4.533789 
## iter   2 value -4.540340
## iter   3 value -4.543668
## iter   4 value -4.545936
## iter   5 value -4.546073
## iter   6 value -4.546278
## iter   7 value -4.546301
## iter   8 value -4.546302
## iter   9 value -4.546302
## iter  10 value -4.546303
## iter  11 value -4.546305
## iter  12 value -4.546307
## iter  13 value -4.546309
## iter  14 value -4.546309
## iter  14 value -4.546309
## final  value -4.546309 
## converged

arimaFit2$fit
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), include.mean = !no.constant, optim.control = list(trace = trc, 
##     REPORT = 1, reltol = tol))
## 
## Coefficients:
##           ar1     ar2      ma1      ma2     sma1     sma2
##       -0.0552  0.2366  -0.3058  -0.5497  -0.7459  -0.0671
## s.e.   0.2374  0.1463   0.2257   0.2101   0.0635   0.0640
## 
## sigma^2 estimated as 0.0001081:  log likelihood = 1085.2,  aic = -2156.4

ARIMA MODEL 3

As a quick check, let’s see what forecast::auto.arima gives us. We get seasonal AR component (P=2) and a smaller seasonal MA component (Q=1). This is a better model (lower AIC of -2156 compared to -2122) than my model above, and the Ljung-Box test passes as well. Let’s continue with this model.

arimaFit3 <- auto.arima(tegip_train)
arimaFit3
## Series: tegip_train 
## ARIMA(2,1,2)(2,1,1)[12] 
## 
## Coefficients:
##          ar1      ar2      ma1     ma2     sar1     sar2     sma1
##       1.5034  -0.5237  -1.8936  0.8981  -0.0034  -0.1699  -0.7619
## s.e.  0.0741   0.0633   0.0515  0.0485   0.0735   0.0658   0.0583
## 
## sigma^2 estimated as 0.0001091:  log likelihood=1087.22
## AIC=-2158.44   AICc=-2158.02   BIC=-2127.65
arimaFit3 %>% checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,2)(2,1,1)[12]
## Q* = 20.245, df = 17, p-value = 0.2619
## 
## Model df: 7.   Total lags used: 24

This model performs quite well, with consistant results on the test dataset as well.

arimaFit3 %>% forecast(h=h) %>% autoplot() + autolayer(tegip_test) + 
        scale_x_continuous(limits = c(2010,2019),breaks = seq(2010,2019))+
        scale_y_continuous(limits=c(3,3.25))
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.

arimaFit3 %>% forecast(h=h) %>% accuracy(tegip_test)
##                         ME       RMSE        MAE         MPE      MAPE
## Training set -0.0004500184 0.01015022 0.00782325 -0.01461536 0.2552590
## Test set     -0.0042431364 0.01433919 0.01106671 -0.13747594 0.3527257
##                   MASE          ACF1 Theil's U
## Training set 0.5452957 -8.050138e-05        NA
## Test set     0.7713708  3.211888e-01 0.3423345

Dynamic Regression Modeling

Here, I wanted to check if introducing an regressor like presence/absence of recession windows as an indicator variable, to see if it improves the ARIMA modeling in any capacity. I’ve also added each month as an indicator variable. The recession indicator doesn’t seem significant. (I also tried other indicators - # of years after recession ended etc, none were significant). Although we see a slightly smaller AIC for this model, the residuals do not pass the Ljung-Box test. I’m not continuing with this approach.

recession_end_years = recession %>% mutate(years = lubridate::year(end)) %>% pull(years)
df <- df %>% mutate(rec_indicator = ifelse(date %in% recession_dates,1,0))
months <- as.factor(lubridate::month(df$date))
months <- model.matrix(~months)[,-1]
dfa <- df %>% bind_cols(as_tibble(months))
xreg_mat_train <- as.matrix(dfa[dfa$date>"1983-12-01" & dfa$date<"2014-01-01",-1:-2])
xreg_mat_test <- as.matrix(dfa[dfa$date>"2013-12-01",-1:-2])

arima_xreg_Fit <- auto.arima(tegip_train, stepwise = F,
                             xreg = xreg_mat_train, 
                             trace = F, )
arima_xreg_Fit
## Series: tegip_train 
## Regression with ARIMA(2,1,2)(1,0,0)[12] errors 
## 
## Coefficients:
##           ar1     ar2      ma1      ma2    sar1  drift  rec_indicator
##       -0.0181  0.2639  -0.2904  -0.5569  0.1954  7e-04        -0.0007
## s.e.   0.2387  0.1549   0.2256   0.2086  0.0548  1e-04         0.0035
##       months2  months3  months4  months5  months6  months7  months8
##       -0.0321  -0.0626  -0.1091  -0.1116  -0.0719  -0.0365  -0.0348
## s.e.   0.0025   0.0032   0.0035   0.0036   0.0037   0.0037   0.0037
##       months9  months10  months11  months12
##       -0.0767   -0.1097   -0.0936   -0.0317
## s.e.   0.0036    0.0035    0.0032    0.0025
## 
## sigma^2 estimated as 0.0001083:  log likelihood=1138.26
## AIC=-2238.53   AICc=-2236.29   BIC=-2164.74
arima_xreg_Fit %>% checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from Regression with ARIMA(2,1,2)(1,0,0)[12] errors
## Q* = 27.606, df = 6, p-value = 0.0001115
## 
## Model df: 18.   Total lags used: 24

ETS

ETS models use exponentially weighted past observations to produce new forecasts. There are quite a few available models, depending on if the Trend, Seasonal and Errors components are Additive, Additive Damped, Multiplicative etc. We can search for the best possible option (based off of AICc) using ets. Here, it selects an (M,A,M) model, which is - multiplicative trend, additive seasonality, and multiplicative errors.

The really small value of beta indicates that the slope changes little. The damping was important to add in this model, without it, the model was completely overshooting the forecast.

etsFit <- ets(tegip_train, damped = TRUE) # damped = NULL lets the model select if damping is needed
etsFit
## ETS(M,Ad,M) 
## 
## Call:
##  ets(y = tegip_train, damped = TRUE) 
## 
##   Smoothing parameters:
##     alpha = 0.5472 
##     beta  = 1e-04 
##     gamma = 0.1807 
##     phi   = 0.9739 
## 
##   Initial states:
##     l = 2.9115 
##     b = 0.0015 
##     s = 1.0088 0.9887 0.9837 0.9944 1.0058 1.0055
##            0.9955 0.9857 0.9897 1.004 1.0147 1.0236
## 
##   sigma:  0.0037
## 
##       AIC      AICc       BIC 
## -1091.922 -1089.916 -1021.972
etsFit %>% forecast(h=h) %>% autoplot()+ autolayer(tegip_test) +
        scale_x_continuous(limits = c(2010,2019),breaks = seq(2010,2019))+
        scale_y_continuous(limits=c(3,3.25)) 

While the training set metrics are only decent (MASE = 0.6), the test set performance is quite poor (MASE of 1.12).

etsFit %>% forecast(h=h) %>% accuracy(tegip_test)
##                         ME       RMSE         MAE         MPE      MAPE
## Training set  0.0008987398 0.01106920 0.008620047  0.02860499 0.2815743
## Test set     -0.0121337058 0.01964737 0.016046373 -0.38978027 0.5128202
##                   MASE      ACF1 Theil's U
## Training set 0.6008339 0.1436723        NA
## Test set     1.1184632 0.4408929 0.4740935

STL

STL decomposing time series into seasonal, trend and irregular components using LOESS can be used for forecasting as well. It’s a simple model to execute, with a few tunable parameters like t.window and s.window which control the width of the signals used for trend and seasonal extraction. The t.window indicates how many consequtive points to use for trend extraction; lower numbers allow more flexible trends. s.window indicates how many consqutive years to use while calculating the seasonal components; lower numbers allow seasonality to change quicker.

I’ve played around with the numbers a bit to get to state I liked. The trend is fairly smooth. The seasonality captures some intersting insights. In 1985, there is predominant “W” shape to the seasonality… as we go towards 2015, it’s a more like a “double-U” shape.

stlFit <- tegip_train %>% stl(t.window = 15, s.window = 5)
autoplot(stlFit)

Do the residuals of this STL decomposition still carry information? YES. The ACF shows significant lags which indicates the residuals are not white noise.

checkresiduals(stlFit$time.series[,"remainder"])

Forecasting using this method, the seasonal components are kept to using a naive approach, while the remainder (Seasonally Adjusted) portion is done using a random walk. We can tell that this method is worse than simply using a seasonal naive model.

stlForecast <- stlFit %>% forecast(method = "naive", h = h) 
stlForecast %>% autoplot() + autolayer(tegip_test) + scale_x_continuous(limits = c(2010,2019)) + scale_y_continuous(limits=c(3,3.25))

stlForecast %>% accuracy(tegip_test)
##                         ME        RMSE         MAE         MPE      MAPE
## Training set  0.0006738193 0.007321679 0.005832362  0.02191946 0.1906848
## Test set     -0.0116026975 0.018674608 0.015457999 -0.37237723 0.4935960
##                   MASE       ACF1 Theil's U
## Training set 0.4065269 -0.1948613        NA
## Test set     1.0774524  0.4039056 0.4503383

Prophet

Prophet is a forecasting tool implemented by our colleagues at Facebook, which can perform quite sophisticated forecasts by considering a large number of input & tunable parameters. I’ll be honest - I’ve only used it a handful of times at work to test some of it’s outlier resistance. So it’s quite a black box for me. I’m using it here to test how well it does “out of the box”, but I won’t spend any time tuning it.

library(prophet)
proph_df_train <- tibble(ds = seq(as.Date("1984-01-01"),as.Date("2013-12-01"),by = "month"),
                         y = as.numeric(tegip_train))
m <- prophet(proph_df_train)
## Initial log joint probability = -2.07089
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
future <- make_future_dataframe(m, periods = h, freq = "month")
forecast <- predict(m, future)
dyplot.prophet(m, forecast)
prophet_forecast <- ts(forecast$yhat[forecast$ds>"2013-12-01"], start = start(tegip_test), frequency = frequency(tegip_test))
Q = sum(abs(diff(tegip_test,lag = frequency(tegip_test))))/(length(tegip_test)-frequency(tegip_test))
MASE = mean(abs(tegip_test - prophet_forecast))/Q
RMSE = sqrt(mean((tegip_test - prophet_forecast)^2))
prophet_yhat = ts.intersect(prophet_forecast, tegip_test)
plot(prophet_yhat,plot.type="single",col=1:2,ylab="EGIP", sub="Red: Actual, Black: Forecast",
     main = paste0("Test Set RMSE: ",round(RMSE,3), "   ", "Test Set MASE: ", round(MASE,3)))

The residuals for this model are decent. I see a few locations where ACF and PACF are violating the limits, and the Ljung-Box test does fail at the 0.05 level.

prophet_residuals = prophet_yhat[,1]-prophet_yhat[,2]
acf_prophet = acf2(prophet_residuals)

Box.test(prophet_residuals,lag = 36, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  prophet_residuals
## X-squared = 56.571, df = 36, p-value = 0.01581

Ensemble

There are a few packages which do perform ensemble time series modeling like forecastHybrid, I haven’t used them much. However, this following “benchmarking” code (obtained from here) is something Hyndman wrote which beautifully takes a few models, runs all combinations of simple averaging across their forecast values, and returns the best possible combination. Often, an average of forecasts outperformans any single model.

Here, I’m combining Seasonal Naive, ETS, ARIMA, STL and Prophet forecasts. In total, this code will check 31 combinations. Each letter here corresponds to a model. Thus, “NE” = Average of “SNaive” & “ETS”.

N = Seasonal Naive E = ETS A = ARIMA S = STL P = Prophet

benchmarks <- function(y, h) {
  require(forecast)
  # Compute four benchmark methods
  fcasts <- rbind(
    N = snaive(y, h)$mean,
    E = forecast(ets(y), h)$mean,
    A = forecast(auto.arima(y), h)$mean,
    S = stlf(tegip_train,h=h)$mean,
    P = prophet_forecast)
  colnames(fcasts) <- seq(h)
  method_names <- rownames(fcasts)
  # Compute all possible combinations
  method_choice <- rep(list(0:1), length(method_names))
  names(method_choice) <- method_names
  combinations <- expand.grid(method_choice) %>% tail(-1) %>% as.matrix()
  # Construct names for all combinations
  for (i in seq(NROW(combinations))) {
    rownames(combinations)[i] <- paste0(method_names[which(combinations[i, ] > 0)], 
      collapse = "")
  }
  # Compute combination weights
  combinations <- sweep(combinations, 1, rowSums(combinations), FUN = "/")
  # Compute combinations of forecasts
  return(combinations %*% fcasts)
}

benchmarks_out <- benchmarks(tegip_train, h)

This plot shows ALL the 31 combinations together - not too much insight here; quite an eye chart.

ts(t(benchmarks_out),start=start(egip_test),frequency = 12) %>% autoplot() + autolayer(tegip_test,color='black',lty=2)

If I pick MAE as the metric of performance to compare these models, here I plot the distribution of MAE for the test set, sorted by the median MAE value. The top three models here are - AP, NA and NAP. Let’s plot them in a time series plot next.

ggthemr("fresh")
benchmarks_df = t(benchmarks_out) %>% as_tibble()
benchmarks_df$y = tegip_test
benchmarks_df = benchmarks_df[,32] %>% bind_cols(benchmarks_df[,-32])
#MAE
mean_MAE = benchmarks_df[,-1] %>% map_df(~abs(.x-benchmarks_df$y)) %>% gather() %>% group_by(key) %>% summarize(m=median(value)) %>% mutate(key = factor(key,levels=key[order(m)]))
benchmarks_df[,-1] %>% map_df(~abs(.x-benchmarks_df$y)) %>% gather() %>% mutate(key = factor(key, levels = levels(mean_MAE$key))) %>% ggplot(aes(key,value))+geom_boxplot() + coord_flip() + labs(x="Model",y="Test Set MAE")

ggthemr_reset()

Visually, these are extremely similar in performance. I’m picking the “AP” model for the final forecasts.

ts(t(benchmarks_out[c("AP","NA","NAP"),]),start=start(egip_test),frequency = 12) %>% autoplot() + autolayer(tegip_test,color='black',lty=2)

Final Pt Estimate

Running the models again on the full (train+test) portions of the data, for Prophet and ARIMA models, and calculating their mean values to estimate the 5-year forecast post Jan-2019.

tegip = BoxCox(window(egip,start=1984), lam)

# Prophet
proph_df <- tibble(ds = seq(as.Date("1984-01-01"),as.Date("2019-01-01"),by = "month"),
                         y = as.numeric(tegip))
m <- prophet(proph_df_train)
## Initial log joint probability = -2.07089
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
future <- make_future_dataframe(m, periods = 12*5, freq = "month")
forecast <- predict(m, future)
yhat_P <- tail(forecast$yhat,60)

# ARIMA
yhat_A = Arima(tegip,c(2,1,2),c(2,1,1)) %>% forecast(h=60) %>% .$mean

final_yhat = (yhat_P+yhat_A)/2
final_yhat = InvBoxCox(final_yhat, lam)

autoplot(egip)+autolayer(final_yhat)+
        scale_x_continuous(limits = c(2010,2024),breaks = seq(2010,2024))+
        theme(legend.position = "none")+
        labs(y="EGIP")

What is the likelihood that the index passes 135, atleast once?

One of the pitfalls of the ensemble approach is the estimation of the prediction intervals. There could be a way to analytically compute and combine the PI for the two approaches, though I will have to look into this.

Another way to computationally to so is to calculate the intervals using bootstrapping approaches on simulated time series. Hyndman describes this here. Using some bootstrapping techniques, he first creates many representative similar time series, then applies a model like ETS / ARIMA to each series, and finally obtains a heuristic prediction interval.

I haven’t done this method before, but I want to learn more about it. So I’ll note it down as a potential future improvement item to this task. [I tried using it here, but it requires creation of a fitted_model object for simulate.]

For now, I’m going to use just 1 model, namely the ARIMA model, and get prediction intervals. These are shown below. The dashed line shows the 135 cutoff in question. We can see that in Jan, for years 2022, 2023 and 2024, there is a possibility that the point estimate may cross 135. In 2014, for ex, the model says we should expect the mean estimate to be between 3.17 and 3.27 (for a confidence of 95%); this contains the threshold of 3.25. (BoxCox-ed 135).

The prediction intervals created by auto.arima are quite optimistic. What the model isn’t accounting for is the variation in the predictions due to parameter estimates.

Arima(tegip,c(2,1,2),c(2,1,1)) %>% forecast(h=60) %>% autoplot()+
        geom_hline(yintercept = BoxCox(135,lam), lty=2) + 
        scale_x_continuous(limits = c(2017,2024),breaks = seq(2017,2024))+
        labs(y="EGIP", caption="The 135 limit is shown after Box-Cox transformation here.")

Visually, we can plot the distribution of mean estimates for Jan-2024. Just for Jan-2024, the area to the right of the red line indicates the probability of obtaining a mean > threshold, in this case, 8.6%.

sigma_h = 0.02440663 #Calculated from the forecast object using y_PI = y_hat_mean + 1.96 * sigma_h
mu = 3.221144
thr = 3.254454
hist(rnorm(1000, mu, sigma_h))
abline(v=thr, col="red")

#1-pnorm(mean = mu, sd = sigma_h, q = thr)

Final observations & Communication to the team

  1. The point forecasts for the test dataset did quite well, for the ensemble models (like AP, NA and NAP) as well as for the pure ARIMA models. I did miss some troughs (like Apr 2016, Apr 2017) and some high peaks (like Jan 2018). It’s possible I can improve these forecasts if I include more information. Perhaps temperature data, or some other measure of activity/consumption.
  2. I want to investigate further the utlization of bootstrapping to determine the prediction intervals, especially for these types of ensemble models
  3. Many of these models show promise. For a 1st review with the stakeholders, we should proceed with one of the forecasts which can be explained (like the ARIMA, or Seasonal Naive). Once we get buy-in from the stakeholders to the potential of delivering accurate forecasts, we can introduce to them the more complex methods (like ensembling), assuming that forecasting accuracy (not explainability) is the most important performance metric.